PowerGrad
=================
计算 Power的梯度。  
对每个输入元素执行线性变换，并根据公式计算梯度结果。

.. math::

    output_i = power \times scale \times Input1_i \times (scale \times Input2_i + shift)^{(power - 1)}

输入：
    - **Input1** - 第一个输入数据地址，对应前向传播的输入。  
    - **Input2** - 第二个输入数据地址，对应前向传播的输入。  
    - **length** - 输入长度。  
    - **power** - 幂指数。  
    - **scale** - 缩放系数。  
    - **shift** - 偏移值。  
    - **core_mask(int, 可选)** - 核掩码（仅适用于共享存储版本）。

输出：
    - **output** - 梯度计算结果地址。  

支持平台：
    ``FT78NE``  
    ``MT7004``  

.. note::
    - FT78NE 支持 fp32  
    - MT7004 支持 fp32, fp16  
    - 若 **scale** 为 0，则输出恒为 0。  

**共享存储版本:**

.. c:function:: void fp_power_grad_s(float* Input1, float* Input2, float* output, int length, float power, float scale, float shift, int core_mask)
.. c:function:: void hp_power_grad_s(half* Input1, half* Input2, half* output, int length, float power, float scale, float shift, int core_mask)

    **C调用示例：**

    .. code-block:: c
        :linenos:
        :emphasize-lines: 11

        //FT78NE示例
        #include <stdio.h>

        int main(int argc, char* argv[]) {
            float *input1 = (float *)0xA0000000;   // input1 在 DDR 空间
            float *input2 = (float *)0xA1000000;   // input2 在 DDR 空间
            float *output = (float *)0xB0000000;   // 输出结果在 DDR 空间
            int length = 1024;
            float power = 2.0f, scale = 0.5f, shift = 1.0f;
            int core_mask = 0xff;
            fp_power_grad_s(input1, input2, output, length, power, scale, shift, core_mask);
            return 0;
        }


**私有存储版本:**

.. c:function:: void fp_power_grad_p(float* Input1, float* Input2, float* output, int length, float power, float scale, float shift)
.. c:function:: void hp_power_grad_p(half* Input1, half* Input2, half* output, int length, float power, float scale, float shift)

    **C调用示例：**

    .. code-block:: c
        :linenos:
        :emphasize-lines: 10

        //MT7004示例
        #include <stdio.h>

        int main(int argc, char* argv[]) {
            float *input1 = (float *)0x10000000;  
            float *input2 = (float *)0x10001000;
            float *output = (float *)0x10002000;   
            int length = 1024;
            float power = 2.0f, scale = 0.5f, shift = 1.0f;
            fp_power_grad_p(input1, input2, output, length, power, scale, shift);
            return 0;
        }